METR

Model Evaluation & Threat Research

METR is a research nonprofit which evaluates frontier AI models to help companies and wider society understand AI capabilities and what risks they pose.

METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.

GPT-5.1-Codex-Max Evaluation Results

We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement or rogue replication. We conclude that this seems unlikely.

Read More
Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

Read More
Measuring AI Ability to Complete Long Tasks
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

Read More
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
RE-Bench — Benchmark and Paper Tracking Automation of AI R&D

Measuring the performance of humans and AI agents on day-long ML research engineering tasks

Read More
RE-Bench — Benchmark and Paper Tracking Automation of AI R&D
Measuring Autonomous AI Capabilities — Resource Collection

An index of our research and guidance on how to measure AI systems’ ability to autonomously complete a wide range of multi-hour tasks

Read More
Frontier AI Safety Policies — Index and Resources

A list of AI companies’ frontier safety policies intended to evaluate and manage severe AI risks

Read More

Evaluation Reports

We have worked with companies such as Anthropic and OpenAI to conduct preliminary evaluations of the autonomous capabilities of several frontier AI models. We do this both to understand the capabilities of frontier models and to pilot third-party evaluator arrangements. (We do not accept compensation for this work.) We also occasionally evaluate models independently after they are released, without involvement from the model’s developers. Recent public reports resulting from this work are below, with additional discussion in the respective system cards.

Partnerships

We partner with AI developers such as Anthropic and OpenAI to conduct evaluations of the autonomous capabilities of frontier AI models. We do this both to understand the models’ capabilities and to pilot third-party evaluator arrangements.

METR does not accept monetary compensation from model developers for this work, but companies including OpenAI and Anthropic have provided access and free compute credits to support our evaluation research. We often use this access and compute credits to continue evaluating models independently after they are released, without involvement from the model’s developers.

We are also partnering with the AI Security Institute and are part of the NIST AI Safety Institute Consortium.

Media Coverage

Recent Updates